home *** CD-ROM | disk | FTP | other *** search
- CHARACTER CODING OF JAPANESE
-
- ******************************************************************************
- * This archive contains the kanji font file KDP16SJ.FNT, which is needed *
- * by the KDPLUS kanji preprocessor system. For those who would like to *
- * know how the font file is organized, the following notes have been *
- * provided which explain Japanese character coding. *
- ******************************************************************************
-
- 1) Starting point: the ku-ten table
- All characters used in Japanese writing can be arranged in a table which is
- called the "ku-ten" table. The table, which is universally used, is 94 columns
- wide and 94 rows high, but rows 85 and up are empty (not used) at present.
-
- Numbering of rows and columns starts at 1 (not zero). Any character can be
- identified by specifying its row number (called its "ku" value) and its column
- number (called its "ten" value).
-
- The symbols in rows 1-47 are called "level 1 JIS (Japan Industrial Standard)
- characters"; they are the most commonly used characters. Rows 48 and up are
- called "level 2 JIS". The level 1 kanji (from row 16) are arranged according
- to pronunciation (on-yomi normally) and stroke count.
-
- A print-out of the "ku-ten" table can be found in the instruction manual of
- every Japanese "wapro" (word-processor) and every Japanese printer. In many
- "wapros" the ku-ten values of characters may be entered by hand. "Office
- Automation Dictionaries", available in Japan, enable you to look up the "ku-
- ten value" of any character.
-
- The "ku-ten" table is not completely standardized in Japan. The standardiza-
- tion applies only to rows 1-8 (kana, alphanumerics) and rows 16 and up
- (kanji); they are defined in JIS standard X-0208. Rows 9-15 are left blank in
- the standard and can, apparently, be filled in by manufacturers according to
- their own ideas. The blank areas in rows 1-8 are considered "reserved".
-
- The complete ku-ten table is contained in six files which go with this archive
- (see section 5).
-
- 2) Kanji fonts
- A kanji font is a set of binary data (a ROM chip, or a disk file) describing
- the actual appearance of the symbols. The file KDP16SJ.FNT is an "almost"
- standard 16 x 16 pixel kanji font (see section 7 for a summary of the changes
- which were made). It contains bitmap images of characters, each bitmap 16
- pixels wide and 16 pixels high; each bitmap therefore occupies 32 bytes.
-
- The character bitmaps are arranged sequentially in the font file according to
- the character's position in the ku-ten table. The offset (in bytes) of the
- bitmap corresponding to character [ku,ten] is 32*((ku-1)*94+ten-1). The font
- file contains bit-maps for the first 83 rows of the ku-ten table (row 85 and
- up are empty anyhow, and row 84 contains only 5 rarely-used characters, so
- this is no great loss). The total number of character images in the font is
- thus 94*83=7802.
-
- The ku-ten table contains many gaps (incompletely filled rows). For instance,
- in row 8 only the first 32 places are filled (with line draw symbols), the
- rest is blank. Row 14 originally contained only 3 symbols (but now we have
- added some IBM control characters to that row). The blank areas are left blank
- in the font file; in other words, they are not skipped, but are represented by
- bit-map tables which consist of zeroes. This is, of course, a waste of space,
- but it makes for flexibility (you can put your own symbols there if you wish)
- and easy decoding.
-
- In the file KDP16SJ.FNT, the bitmap images in rows 9, 10, and 11 use only the
- left-hand half of the 16 x 16 pixel box. They can be displayed with a
- horizontal spacing of 8 pixels. 8-pixel, or half-character, symbols are called
- hankaku; characters which use the full 16 x 16 box are called zenkaku. In a 24
- x 24 pixel font, zenkaku characters are be 24 pixels wide, hankaku characters
- are 12 pixels wide (in the font KDP24SJ.FNT, used by KPLJ24, the hankaku
- characters are in fact 13 pixels wide; KPLJ24 inserts 2 empty pixels between
- zenkaku characters to keep the zenkaku spacing twice the size of the hankaku
- spacing).
-
-
- 3) JIS coding
- The number of columns in the ku-ten table, 94, is not arbitrary; it is derived
- from the number of 7-bit ASCII characters. With 7 bits, 128 different
- characters can be represented; leaving out the characters 0 and 127, and also
- the characters 1-32 (control characters and space), we are left with 94
- printable characters, having the numerical values 33-126.
-
- Any character in the ku-ten table can now be represented by 2 bytes:
-
- first byte : "ku" value + 32
- second byte: "ten" value + 32
-
- The first character in the ku-ten table, [ku=1, ten=1] is thus represented by
- the two bytes [33,33]. The first kanji character in the table (the character
- with pronunciation "A", meaning Asia), with ku=16 and ten=1, would be
- represented by the bytes [48,33], or, in ASCII, "0!".
-
- Thus we have a system of transmitting Japanese characters on channels which
- use 7-bit characters (especially mainframe systems). This is called the JIS
- code.
-
- The problem which now arises is this: a terminal capable of receiving kanji
- data according to the system described above would interpret each character as
- one half of a kanji. It could not receive normal ASCII text without changing
- it into some garbled mess of kanji and kana. It would, of course, be desirable
- if the same terminal could interpret ASCII characters according to their
- normal meaning ALSO. The solution which was adopted for this may be inelegant,
- but is unavoidable within the limitations of the 7-bit format. It consists of
- switching between two modes: "ASCII mode" and "kanji mode". The mode is
- switched by means of an escape sequence. JIS code systems need two escape
- sequences:
-
- kanji in (KI) sequence: changes from ASCII mode to kanji mode
- kanji out (KO) sequence: changes from kanji mode to ASCII mode
-
- Of course, the disadvantage of this method is that the KI and KO strings may
- become garbled in transmission, leaving the system in the wrong mode. But I
- suppose a better solution wasn't possible in systems using only seven bits.
-
- KI and KO strings differ, according to the "dialect" of the JIS code which is
- in use. Three major dialects are "old JIS", "new JIS", and "NEC", which have
- respectively:
- KI KO
- ======= =======
- old JIS ESC $ @ ESC ( H
- new JIS ESC $ B ESC ( J
- NEC ESC K ESC H (pica), ESC E (elite)
-
- "Old JIS" is, for instance, used by JICST and the Nikkei Telecom News data-
- base service. "New JIS" is used by the kanji editor program MOKE (by Mark
- Edwards), and in the Japanese section of the GENIE network. NEC printers use
- the NEC code.
-
- Some JIS systems can also handle hankaku katakana characters. These characters
- are encoded by one byte, with value 21 - 5f hex. To indicate that such codes
- must be interpreted as hankaku katakana rather than normal ASCII, hankaku
- katakana strings must be preceded and followed by special codes:
-
- the character SO (Ehex) switches from ASCII to hankaku katakana;
- the character SI (Fhex) switches from hankaku katakana to ASCII.
-
- This system is used to communicate with the 7-bit, "old JIS" data-bank JICST.
- You initiate a search by typing a keyword in ASCII or hankaku katakana
- (JICST does not accept zenkaku characters for input). The response from the
- system is in ASCII and "old JIS" zenkaku characters.
-
- The default mode for JIS systems is ASCII mode.
-
-
- 4) EUC coding
-
- EUC (Extended Unix Code) is a variant of JIS which is used on eight-bit UNIX
- systems such as can be found in university environments. The coding system is
- exactly the same as JIS, but the switch between ASCII mode and Kanji mode is
- not indicated by escape strings. Instead, characters in kanji sequences have
- the high bit set, while ASCII characters have the high bit cleared (zero).
-
-
- 5) SJIS coding
-
- In bulletin board systems (which are always 8-bit), and frequently also for
- internal character representation in Japanese personal computers, the so-called
- SJIS code is used. SJIS means shift-JIS, probably to indicate that "shifted"
- (high bit set) characters are used. They are used, however, in a way which is
- very different from that of the EUC system.
-
- There are three kinds of SJIS codes: controls, one-byte characters, and two-
- byte characters.
-
- Controls are represented by one byte, having the values 0-1f hex, or 0-31
- decimal. Controls include codes for new line, carriage return, form feed, back
- space, etc.
-
- One byte characters are represented by one byte having a value ranging from 20
- to 7E hex (32 to 126 decimal) or from A0 to DF hex (160 to 223 decimal). For
- values in the rage 20 to 7E hex, the meaning of the characters is the same as
- in standard ASCII. The range A1 to DF hex is used for hankaku katakana; these
- values are the same as the JIS hankaku katakana, but with the high bit set. On
- the IBM PC, this range is occupied by the "box draw" characters. The value
- A0 hex represents a space (same as 20 hex).
-
- A peculiarity is that on some systems (for instance the KDPLUS system) the
- one-byte characters can also be coded with two bytes; this is the case when
- the characters have been put somewhere in the non-standardized part of the ku-
- ten table, so that they have a normal two-byte address. On some systems (an
- example is the Ichitaro word-processing system on an AX) ASCII and hankaku
- katakana are kept out of the ku-ten table altogether, so these characters can
- only be selected with one-byte codes.
-
- Two-byte characters are represented by a "high" byte followed by a "low" byte.
- In order not to be mistaken for a control or a one-byte character, the "high"
- byte must use values which are not used by those characters, in the ranges 81-
- 9F hex and E0-EA hex. The "low" byte uses values in the range 40-FC hex, but
- the value 7F hex is skipped (not used). This may be a relic from the paper
- tape era. On paper tape systems, "all holes punched" was never used for a
- character, so that it was possible to erase characters on the tape by
- overpunching them.
-
- There are 188 possible values for the "low" byte and 42 for the "high" byte.
- Every possible value of the "high" byte can now encode 2 rows (2 x 94
- characters) of the "ku-ten" table. In total therefore, 84 rows could be
- encoded, but only one row is encoded for the characters with "high byte" equal
- to EA hex.
-
- The algorithm for converting "ku-ten" values to "high-low" values is:
-
- high=0x80+(ku+1)/2 ; /* 2 ku values share the same high byte. */
- if (high>0x9F) high+=0x40; /* if outside 81-9F range, lift to E0-EA range*/
- if (ku&1) { /* ku is odd*/
- low= 0x3F+ten;
- if (low>=0x7F) low++;
- }
- else low= 0x9E+ten; /* ku is even */
-
- The decoding algorithm is equally straightforward: assume that we have already
- determined that a two-byte character has been sent, and we have the "high" and
- "low" bytes available. We calculate the "ku" and "ten" values as follows:
-
- if (high>=0xE0) high-=0x40;
- high-=0x80;
- ku=2*high - 1; /* always produces an odd value */
-
- if (low > 0x9E) { /* ku is even: increase the value */
- ku++;
- ten=low-0x9E;
- }
- else { /*ku is odd*/
- if (low>0x7F) low--;
- ten=low-0x3F;
- }
-
- The treatment of the one-byte characters depends on where the hankaku
- characters are stored in the font, because this is hardly standardized. In the
- font KDP16JS.FNT, the hankaku ASCII characters are stored in row 9, and
- hankaku katakana in row 10. So we calculate "ku" and "ten" as follows:
-
- if (ch<0x20) { /* control character */
- /*....put appropriate code here....*/
- }
- else if ((ch==0x20)||(ch==0xA0)) { /* hankaku space */
- ku=11;
- ten=1;
- }
- /* The separate treatment of the hankaku space
- is necessary, because, inconveniently, the
- hankaku ASCII row in the font file does not
- start with a space, but with the exclamation
- mark (ASCII 0x21). We get the space from row
- 11, which does start with a space. */
-
- else if ((ch>0x20)&&(ch <= 0x7E)) {/* ASCII */
- ku=9;
- ten=ch-0x20;
- }
- else if ((ch>0xA0)&&(ch <= 0xDF)) {/* hankaku katakana */
- ku=10;
- ten=ch-0xA0;
- }
-
- else { /* not a one-byte character, but
- first half of two-byte character. */
- /*....put appropriate code here.... */
- }
-
-
- Of course many tricks can be applied to make the code more compact and faster.
- The separate treatment of the hankaku space can also be avoided with a small
- trick. The above explanation shows the principle, however.
-
- It is quite easy to make your program recognize KI and KO strings, and switch
- automatically between SJIS and JIS coding. It is not so easy to distinguish
- automatically between SJIS and EUC (at least not on the basis of single
- characters).
-
- 6) "Ku-ten" table files
-
- You can obtain a print-out of the ku-ten tables for Level 1 and Level 2 JIS
- by printing the files:
-
- level1.1
- level1.2
- level1.3
-
- (for Level 1 JIS)
-
- level2.1
- level2.2
- level2.3
-
- (for Level 2 JIS)
-
- Because the ku-ten tables are too wide to be printed on one sheet, they have
- been split into three parts, covering the columns 1-32, 33-64, and 65-94
- respectively. You can print all three of them on a Japanese printer or word
- processing system, or on a "Western" printer using the print utilities of the
- KDPLUS system. Glue the tables together to get complete ku-ten tables.
- The tables are SJIS coded. To convert to JIS, use the KDPLUS SJIS2JIS utility.
-
- 7) Changes made in KDP16SJ.FNT
- A few changes have been made in KDP16SJ.FNT to adapt it for use with KDPLUS
- and the KDPLUS editor, JWRITE. The most important of those changes is that the
- IBM control code symbols (corresponding to ASCII values below 32) have been
- added to row 14 of the font, from position 11, and the IBM characters with
- values EB-FE are in that same row from position 75. Furthermore, the character
- ASCII 92 (5C hex), corresponding to ku=9, ten=60, is now displayed
- as a backslash, to make it conform to normal IBM PC usage (in the original
- KDP16SJ.FNT, as on most Japanese computer systems, this character is a "yen"
- sign). Also, some cosmetic changes have been made in the "tilde", apostrophe,
- reverse apostrophe, and quotation mark symbols, to make them useable as
- accents. The "equals" sign (=) has also been slightly modified. In combination
- with the capital Y, it makes a nice "yen" sign (through the accent facility of
- JWRITE), should you need it.
-
- If you don't like these changes, you can undo them using the font editor
- KFEDIT that comes with KDPLUS.
-
-
- Tokyo, 10 July 1991 (revised 14 January 1992, 16 February 1992, 20 May 1992)
- Jan W. Stumpel
-
-